Proposal of Smith-Waterman algorithm on FPGA to accelerate the forward and backtracking steps

In bioinformatics, alignment is an essential technique for finding similarities between biological sequences. Usually, the alignment is performed with the Smith-Waterman (SW) algorithm, a well-known sequence alignment technique of high-level precision based on dynamic programming. However, given the massive data volume in biological databases and their continuous exponential increase, high-speed data processing is necessary. Therefore, this work proposes a parallel hardware design for the SW algorithm with a systolic array structure to accelerate the forward and backtracking steps. For this purpose, the architecture calculates and stores the paths in the forward stage for pre-organizing the alignment, which reduces the complexity of the backtracking stage. The backtracking starts from the maximum score position in the matrix and generates the optimal SW sequence alignment path. The architecture was validated on Field-Programmable Gate Array (FPGA), and synthesis analyses have shown that the proposed design reaches up to 79.5 Giga Cell Updates per Second (GCPUS).


Introduction
In Bioinformatic, the analysis can be divided into three parts called primary, secondary, and tertiary analysis [1,2]. The primary analysis is responsible for generating genomic data information from biological material. In the primary analysis, the sequencing machines create raw genomic data (or raw data). The raw data is composed of several genome reads. The secondary analysis involves reads alignment and trimming based on quality, and at the end of this step, a whole genomic is created. Finally, tertiary analysis can be characterized as interpreting results and extracting meaningful information from the data. In this last step, many algorithms and techniques can be applied. Also, many applications are created from these analyses. The tertiary analysis covers various applications, from genome characterization to a vaccine or drug treatment creation [2]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 To overcome the low-speed processing bottleneck and maintain the optimal alignment of deterministic algorithms, parallel hardware implementations for the SW algorithm have been proposed in the literature. The main platforms used are Field Programmable Gate Arrays (FPGAs), Central Processing Units (CPUs), and GPUs. FPGAs are widely known for their flexibility for parallelization and low-power consumption. An FPGA is a matrix of logic blocks that allows designing different circuits, such as processors, logic circuits, and even algorithm development [31]. FPGA platforms can be categorized as third generation computational infrastructure in bioinformatics, as it is a CHA. Also, the logical blocks within the FPGA are independent, allowing operations to be carried out in parallel and only one clock cycle, unlike CPUs that operate sequentially based on instructions and GPUs that require constant access to memories.
GPUs are well-known for their high degree of parallelism and computing intensity. However, high-performance GPUs have a high cost, and, broadly speaking, GPUs have significant computing latency, and low energy efficiency compared to custom hardware, but with more energy efficiency than CPUs, as seen in the results [22,23,40]. The high computing latency is due to the high number of cores and low cache memory to control these cores. In contrast to GPUs, FPGAs are customizable according to the user's needs, achieving better computing performance and lower latency [15,54,55]. However, FPGA hardware development is usually complex and takes a long time.
Unlike the conventional platforms previously mentioned, the SW algorithm has also been implemented on ResCAMs, as can be seen in [22]. ResCAMs is a storage accelerator system that allows millions of processing units (PUs) to be deployed over multiple silicon arrays. In [22], the implementation was used to compare the homologous chromosomes between humans (GRCh37) and chimpanzees (panTro4), and the only SW step performed was the building of the score matrix. As a result, their proposal achieved 5, 300GCUPS, a 4.8× speedup over the GPU performance. Besides, it also had a 1.7× better energy efficiency compared to an FPGA implementation.
In [45], a hardware/software co-design was implemented on FPGA to reduce the execution time of short-sequence alignment during genome sequence analysis. The analysis was performed using the Shouji method, a highly parallel and accurate pre-alignment filter that reduces the need for computationally-costly dynamic programming algorithms. The FPGA was used to boost the algorithm's performance. As a result, integrating the Shouji and aligner methods reduced the alignment total execution time by up to 18.8×.
In [48], a systolic-array-based architecture is presented as a DNA sequencer, using the SW algorithm affine gap penalty score. According to [48], the SW algorithm was implemented on a Xilinx Virtex-6 FPGA, reaching 465 Giga Cell Update per Second (GCUPS), and reducing the area occupation by 90% compared to other architectures.
In [50], an FPGA runtime accelerator is presented to align data information sequences. The authors' implementation is based on the seed-and-extend model of Bowtie2, achieving a similar alignment rate while mapping reads by �2× faster. Meanwhile, it is proposed by [46] an FPGA implementation of the SW algorithm to replace the GPU used in [40], called SWIFOLD. The FPGA implementation is based on OpenGL and utilized for long DNA sequences alignment. The SWIFOLD approach for accelerating the SW alignment reached an average of 125 GCUPS.
In [41], an ASIC design for traceback recording with penalty scoring is proposed. The design is implemented on a TSMC 40nm technology, and the aligner strategy can speed up pairwise alignment by 71× compared to the CPU.
Meantime, in [47] is presented an FPGA approach to meet the alignment operation processing requirements. The paper introduces a register-file concept to reduce run-time storage, and it does not require any sorting or comparison operations to prepare for the final sequence alignment. As a result, a 128 GCUPS performance was achieved using 256 PEs.
In [56], an FPGA hardware implementation of SW and NW algorithms is presented. The performance and area occupation is evaluated for different hardware designs. Besides, a convolution neural network model is introduced to implement the NW algorithm, achieving 98.3% accuracy.
In [23], a heterogeneous FPGA architecture for sequence alignment is proposed. Unlike most of the works in the literature, their implementation aims to accelerate the entire SW algorithm with the backtracking process. For this purpose, the architecture can process long strings of data based on parallelism and partitioning strategies; and the backtracking process was performed by dividing the equal parts of the similarity matrix, while the search started from the lower right sub-matrix. The tests were performed for 512 Processing Elements (PEs), reaching 76.8GCUPS at 150MHz and 105.9GCUPS (with external memory) for 200MHz. As a result, a speedup of 3.6× to 25.2× was achieved regarding other SW designs implemented on FPGA and GPUs. Besides, it reached a 26% reduction in power consumption compared to the GPU implementation.
Similarly, more FPGA approaches using systolic arrays for the NW and SW with backtracking sequencing techniques have been proposed, such as [39,57]. In [39], a VHDL SW implementation, using Dynamic Programming (DP) with approximation correspondences for two different strategies, was proposed. It achieved 23.5GCUPS with speedups between 150× to 400× compared to a 2004-era PC. Meanwhile, in [57], the implementation was based on PEs to perform elementary calculations and a diagonally backtracking search, also developed in VHDL. Comparisons were made with the linear and affine strategies, achieving 10.5GCUPS.
In [44], the SW forward and backtracking processes were implemented in an FPGA. The Qnet structure was adopted for communicating with the FPGA, reaching 25.6GCUPS, a speedup of 300× compared to a desktop computer.
Therefore, it can be noted from the literature that the key points for a high-performance SW implementation on FPGA are the operating frequency and number of PEs, which in turn are associated with the hardware capacity and design critical path. Thus, we present an FPGA implementation for the SW algorithm using systolic arrays, as in [23,39,44,57].
Our approach performs both the forward and backtracking stages of the algorithm. Unlike the approaches in the literature, we obtain the alignment path distances during the forward stage processing and the maximum score, reducing the complexity of the backtracking stage processing. Memories are used to propagate the distances and maximum score, allowing the backtracking step to follow the path directly. Thus, our architecture achieves good performance (short critical path), reduced memory usage and, theoretically, high scalability, and prevents memory access overlap latency, even implementing the two stages of the SW algorithm.
It is essential to mention that our proposed hardware architecture can perform the alignment of any sequence length. However, the resources available in the target FPGA are a limiting factor for the size of the score matrix, as shown in [23,47,58]. Nonetheless, we provide a proofof-concept and an actual implementation on the Virtex-6 FPGA using a synthetic dataset.

Smith-Waterman algorithm
Smith and Waterman originally proposed the SW algorithm in 1981 to performs local sequence alignment of nucleotides and proteins in the biological field [12]. The sequence alignment of the SW algorithm includes the forward and backtracking stages, which are performed by the calculation results of the alignment similarity score. Besides, the alignment is performed based on two input sequences called query sequence, q, and dataset sequence s. The query sequence can be expressed by where q j is the j-th nucleotide or amino-acid protein and N is the length of the query sequence. The dataset sequence can be expressed by where s i is the i-th nucleotide or amino-acid protein, and M is the dataset sequence length. Therefore, the SW algorithm is calculated iteratively for two dimensions, and it has a computational complexity of O(M × N).
The forward stage calculates the scoring matrix, H, where H is a two-dimensional array that can only take values greater than or equal to 0 (i.e., H 2 N 2 ). This matrix is generated by comparing the elements of the sequences q and s. Usually, H is generated using DP, and it is initialized with zeros in the first row and column. Subsequently, the DP process is performed to calculate the sequence scores. Based in works presented in [23,36,59], the recurrence relationship can be defined as H M;N ¼ Hði; jÞ ¼ maxf0; Eði; jÞ; Fði; jÞ; Hði À 1; j À 1Þ þ Pðs i ; q j Þg Eði; jÞ ¼ maxfHði; j À 1Þ þ r; Eði; j À 1Þ þ sg Fði; jÞ ¼ maxfHði À 1; jÞ þ r; Fði À 1; jÞ þ sg where Hði; jÞfði; jÞ 2 N j 1 � i � M; 1 � j � Ng, P is the score matrix used for obtaining the similarity score between s i and q j , E and F are two assisted matrices when calculating matrix H, ρ is the gap opening penalty and σ is gap extension penalty. In the particular case of ρ = σ, a linear gap penalty model is obtained, opening and extending a gap with the cost γ. P is also called a substitution matrix, where the simplest version is when the diagonal receives the match value and the rest of the matrix has a mismatch value. When performing all element calculations, this expression is the H M, N matrix. Therefore, H(i, j) is the maximum alignment score of two sub-sequences s and q. The initialization condition is The maximum score value of H(i, j) in the forward stage is the last sequence that will be aligned. To determine the relationship, the previous neighborhood values of the analyzed element are required, i.e., the values on the diagonal, horizontal, and vertical positions, as illustrated in Fig 1. As can be observed, the score of w can be found based on its neighborhood (x, y, v), which is H(i − 1, j − 1), H(i − 1, j), H(i, j − 1), respectively. This windowing step occurs throughout the process of determining all scores in H. Fig 1, the neighborhood values x, y and v, must necessarily be known to determine the value of w (i.e., H(i, j)). For this purpose, those values are defined based on the sequences q and s. Thereby, the w score is determined as

As shown in
where γ, α, and β represent the linear gap, a match, and a mismatch, respectively. A gap is a penalty that causes an empty element in the sequence (represented by a dash symbol), while the other sequence continues. It can result from the query or database sequence. The Eq 5 is equivalent the Eq 3, where x+ (α_β) = H(i − 1, j − 1)+ P(s i , q j ), y+ γ = F(i, j) and v+ γ = E(i, j). Finally, when fully populated, the H matrix contains the score and path information.
The backtracking stage starts after determining all the scores in the H matrix, i.e., calculating the score of all cells H(M, N). Hence, the backtracking begins at the cell with the highest value in the H matrix (maximum score) and trace-back the next position based on the highest neighborhood value, according to Eq 5, which can be on the diagonal, horizontal, or vertical direction. This is an iterative process that repeats until it reaches the limit value, usually set to a score of 0. Also, a directional flag indicates the path. Finally, the backtracking path determines the best local alignment. The diagonal direction points to a match in the alignment, while the horizontal and vertical directions indicate gaps which are represented by dashes in the s and q sequences, respectively.

Implementation description
The hardware architecture for the SW algorithm proposed in this work was developed using systolic arrays to input two DNA sequences and increase the processing speed of the local sequence alignment. An overview of the systolic array structure of the proposal for N PEs is shown in Fig 2. Besides, each PE is divided into 3 modules. These modules are the forward stage, the storage process, and the backtracking stage, as seen in Section 2. Each module is illustrated in blue, green, and yellow, respectively. The forward stage has its module named as Matrix Score Module (MSM), the storage process module is called as Memory Module (MM), and the backtracking stage has its module as backtracking stage (BS).
The labeled signals shown in Fig 2 are generated outside the modules. Meanwhile, the nonlabeled ones are generated by computations inside the modules and detailed throughout this Section. The sequences q and s, defined according to Eqs 1 and 2, are external discrete signals used as inputs of the SW algorithm. Furthermore, each signal in the sequences represents one of the four DNA nucleotides, i.e., A, G, T, or C (also accepting twenty distinct levels referring to amino acids or another set of sequences). The design proposed supports any sequence set as in any SW algorithm, but for an efficient alignment, it is necessary to adopt a suitable scoring matrix that models each possible symbol's frequencies that can occur in the sequences.
Initially, the circuit starts when the MSM modules propagate the q and s signals. As seen in Fig 2, each k-th element of the q and s sequences are shifted to each MSM output to shorten and stabilize the critical path, as well as allowing the computation of scores synchronously, preserving the systolic array structure. Afterward, the MSM computes the score according to Eq 3, and propagates the sequence elements to the next MSM; also, the computed results are sent to the respective MM in their order of entry. During this process, the MM operates exclusively in writing mode while the process has not reached the last computation between the two sequences.
The forward stage is completed after fully computing the scores of the H matrix. Also, the last MSM enables the backtracking process. Consequently, the MM switches to the read mode, and the BS reads the data computed by its respective MSM. The alignment starts from the calculations performed in the MSM. Then, from the respective defined PE in the forward stage, the process starts and ends according to the definitions of the SW algorithm. Fig 3 shows the block design that represents each PE of the systolic array, with a detailed description of the signals between the modules within one PE. As can be observed, besides the two input sequences to be compared, q and s, the MSM also receives an enable signal, e n. After computing the score between each k-th element of the two sequences (i.e., an element of the H matrix), the MSM outputs to the next PE the following signals: the calculated score, Sc j ; the maximum score, MaxVal, and its position, AddrRAM i^j ; the PE index; along with the input signals q, s, and en, shifted in time. In addition, the MSM also outputs signals to the MM, which are the calculated path direction, Direction, and the storage address of that path wAddrDir.
Subsequently, after fully populating the H matrix and, consequently, the D matrix, the forward stage is finished enabling the Traceback signal, which in turn begins the BS. Firstly, the BS sets signal BT Start to 1, indicating the start of the backtracking process. Therefore, the mRAM i^j are propagated back until it reaches the BS with maximum score, which is identified by the signal index. From this location match, the btcontrol signal is changed to allow the reading of the memory by MM. Thus, the BS receives the path value from the MM at signal d j when sending the memory address rAddrDir j signal. The d j value allows the BS to calculate the next requested address and propagate it to the next module through the path(j) signal, representing the memory address of the request path in MM. Lastly, the alignment value enters val-Dir, and the process continues until it reaches the complete alignment. All modules are detailed in the following subsections. All signals present in this Section are shown in Table 1.

Forward approach
Firstly, based on the principles "divide and conquer" for solving computational problems, we propose a matrix used to store only the values of the recursive path, called the D matrix. The D matrix is not widely used in the SW literature. However, it is important to achieve a solution at lower-level programming. Besides, a matrix with two different types of information, such as the H matrix, increases the hardware design complexity. Matrix D needs to store only 4 levels of values which are: 0, 1, 2 and 3. Each element of the matrix D needs 2 bits to be expressed, delivering a more economical storage process compared to H, which can certainly need more than 2 bits to represent each element.
As previously mentioned, the alignment process is performed based on the query and dataset input sequences, q and s, respectively. Also, there can have different sizes, represented by N and M, which define the size of the matrices H and D, respectively. The Matrix Score Module (MSM) calculates the scores and distances in columns of matrices H and D in parallel.
The systolic array structure developed for the matrices is composed of N PEs. Therefore, for each j-th element in q, there is a j-th PE. It is based on dividing the construction of the H score matrix expressed by and finding the best path in which the D matrix returns the correct sequence alignment, which in turn is equivalent to the directional flags that determined the alignment path. Moreover, for each PE j (which represents a column of the matrix H) there is i-th s(i) that varies from 0 to M − 1, according to the following : ð7Þ The number of MSMs submodules corresponds to the number of elements in q, i.e., fj 2 N j 0 � j < Ng, as can be observed in Fig 4. Therefore, H is formed by N columns, according to Eq 6. Besides, the MSM also calculates the path, the maximum score value and its position, which are subsequently stored in the Memory Module (MM).
The SW algorithm in this work is initialized by the en(k) signal, which enables the memory components in the MSM and MM modules to allocate the two sequences q(k) and s(k). The en(k) is a sequence of pulses of value 1 with size equal to the s sequence. Thus, the sequences  Each k-th q element is compared to all elements in s, iteratively, i.e., the s is traversing, going from the first PE to the last one. If the values are equal, a value from the Match constant is propagated; otherwise, the value of Mismatch is propagated. Match corresponds to a reward for similarity, while Mismatch is a penalty for inequality between values. Afterward, the addition block sum the values according to g j ðiÞ ¼ g jÀ 1 ði À 1Þ þ a q j ¼ sðiÞ where α and β are arbitrary values that correspond to the match value and mismatch values, respectively. Subsequently, the score value, Sc j−1 (i − 1), and correspondence value, α^β, are added to define a portion of g j (i). The Sc j −1 (i − 1) value is equivalent to the H(i − 1)(j − 1) value (i.e., g j−1 (i − 1)). The values of Sc −j (i), MaxValue(−1), AddrRAM i^j (−1) and index(−1) are initialized with 0. At the same time, the Sc j−1 (i), which is the score value of the previous block, it is received and operated with the value of Gap. In addition, the value of the scoring operation of this block in the previous time, Sc j (i−1), is also operated with the Gap. Thus completing the computation of g j (i) that can be seen in the Eqs 5 and 9.  calculated scores is carried out based on Eq 5 as follows g j ðiÞ ¼ max 0 g jÀ 1 ði À 1Þ þ a q j ¼ sðiÞ g jÀ 1 ði À 1Þ þ b q j 6 ¼ sðiÞ where γ is an arbitrary value that represents the chosen linear gap value. This expression is equivalent to Eq 3. The output of the pink blocks, called opr, are propagated to the next submodule for choosing the maximum score and distance path, as shown in Fig 5. This submodule is built with a set of multiplexers and relational circuits that can find the maximum score value with the coded distance of the path by comparing the opr signals, as seen in Fig 6. Selecting path distances is based on a simple encoding of three levels representing the alignment action to be adopted: 2, 1, and 3. Therefore, the levels 2, 1, and 3 represent a match, a gap in the target sequence q and s, respectively, as described in Section 2. The encoding process of directions is performed in the forward step, as illustrated in Fig 6. During this process, the same signals used to calculate the H score matrix are needed, i.e., the opr j−1 (i−1), opr j−1 (i) and opr j (i − 1), as seen in   1, j) and v = H(i, j − 1) are known values, while w is a score to be computed. Starting from w = H(i, j) as the observed cell for determining a generic path and x, y and v as the neighborhood. An integer value is associated with the d j corresponding to the address of w, according to the maximum value determined in the neighborhood, these values are assigned according to the expression where 1, 2 and 3 is the vertical, diagonal, and horizontal paths, respectively. The Eq 10 is equivalent to the circuit implementation illustrated in the Fig 6, where (x+ (α_β)) = opr j−1 (i−1), (y−γ) = opr j (i−1) and (v−γ) = opr j−1 (i). Besides, (α_β) = α for a match and (α_β) = β for a mismatch.   After the process of selection the score and direction, it has the choice of the maximum score based on a logic of multiplexers and relational blocks. There is a counter, called cntR, to determines the number of times that the selection of the score and direction is carried out, i.e., the H matrix row that the process is on. This is necessary to determine the AddrRAM i address. At the beginning of MSM processing, index(j − 1) is added to 1, just once for each MSM, becoming index(j) and determining the address of this MSM. For the determination of Maxval, it is seen whether the previous value is less than the current computed score value, then the calculated current score value becomes the Maxval, AddrRAM j = index(j) and respective row process value is AddrRAM i . It is noted the AddrRAM i^j signal are corresponding to the location of the maximum score value.
In parallel with the process of determining the maximum score value, there is the process of storing the directions. Thus, the output Direction of the submodule is prepared in set with the value wAddrDir, which comes from the H matrix row calculated at that moment, allowing to write in order in RAM memory according to the respective positions of H matrix (i.e., same position of D matrix).
Finally, according to the systolic structure, after the MSM processing is over, the signals are parallelly sent to the next MSM. Thereupon, q(k), s(k), and en(k) are shifted in time, that is, q(k − 1), s(k − 1), and en(k − 1), to match the calculation structure of the H matrix, as seen in the Fig 4. Besides, the calculated signals Sc(i), MaxVal, AddrRAM i^j and index are also propagated to the next MSM to preserve the scores calculating structure. This process repeats until the last element of s is calculated with the last element of q; a counter in is used to determine that moment since the values of the sequences are previously informed to all PEs. The forward stage finishes with the calculation of the last element of the matrix, i.e., H(M − 1) (N − 1). Consequently, the signal Traceback is enabled, indicating the end of the process in all MSM, and the addresses AddrRAM i^j corresponding to the maximum score value is sent to the next step (i.e., backtracking process).
Algorithm 1 presents the SW pseudo-code for forward stage and storage process structures. The Algorithm 1, is prepared to perform the calculation of scores and storage of matrices H and D. The input is the signals q and s, which is Eqs 1 and 2, respectively. The first loop, in the Algorithm 1, represents each N element used, as seen in

Memory Module (MM)
The MM communicates with both the MSM and the BS, as shown in Fig 7. During the forward stage, the data regarding the distance values are written to the MM. Meanwhile, during the BS, the memory addresses to align the sequences are fetched from the MM. The size of each memory is defined by the size of the s sequence; also, there is a flag to indicate that the memory is in write mode while computing the H matrix and, subsequently, in fetch mode, in the backtracking process.
The MM consists of Random Access Memories (RAMs) used to store the path directions, Direction, obtained in the MSM that is thereafter needed in the BS module. Hence, the RAMs are in write mode throughout the forward stage and reading mode during backtracking. The RAM input ports are the address and data busses and write enable mode. Besides, the memory size of each memory is defined based on the size of the sequence s, which in turn, the amount of RAM memories is equal to the number of PEs in the systolic array.
The enable signal, en, is used as write enable for each RAM in the MM. Therefore, en = 1 defines the write mode, while en = 0 the read mode. In addition, the btcontrol signal selects which module controls the RAM address bus. Hence, for btcontrol = 0 the memory addresses are defined by the MSM module through wAddrDir signal, while btcontrol = 1 selects the BS module to define the addresses via rAddrDir signal.
Thus, in write mode (en = 1 and btcontrol = 0) the wAddrDir signal defines the address of the RAMs where the Direction value is stored by the MSM. Subsequently, after the H matrix is fully calculated, the Traceback is enabled to indicate the end of the forward stage, and the MM goes into reading mode (en = 0 and btcontrol = 1). Accordingly, the rAddrDir signal defines the address space the BS fetches the data corresponding to the value reported by the traceback.

Backtracking approach
The backtracking process starts when the Traceback signal is enabled in the MSM by counters that determine the last PE and the last processed element of s, as described in forward stage. As previously mentioned in subsection 3.1, the MSM propagates to the MM the maximum score address that is used as the starting point for alignment, as shown in Fig 8. Meantime, the Fig 9 details the submodules used to create each BS module. The submodules in green are circuits for controlling and synchronizing all signals during the module operation, while the blue submodule performs the alignment path described in this section.
Firstly, after Traceback is enabled, the BTStart signal is enabled, and the addresses of the maximum score element, Addr RAMi (N − 1) and Addr RAMj (N − 1), are sent to the respective BS. Also, the values of Addr RAMi (N − 1) and Addr RAMji (N − 1) are assigned to mRAMi(N − 1) and  (N − 1), respectively, by the BT Enable submodule. It is important to emphasize that if the mRAMj(N − 1) value (i.e., Addr RAMj (N − 1)) is not already in the BS PE, it will trace-back by checking the Memory Index submodule. This process happens until it reaches the PE corresponding to the maximum score location. Afterward, the Memory Index submodule assigns

PLOS ONE
Proposal of Smith-Waterman algorithm on FPGA to accelerate the forward and backtracking steps mRAM i value to rAddrDir to read the memories in the MM, which in turn, returns the d(i) value to the Direction Process submodule, as can be seen in Fig 9. Secondly, the alignment process starts. The circuits used to build the alignment submodule are shown in Fig 10. As can be observed, the input d j (i) is used as the multiplexer selector to perform the Eq 10. Therefore, for d j (i) = 3, BS remains in the same memory position and moves back one BS module, i.e., horizontal displacement. While for d j (i) = 1, only the memory position decreases by 1, and BS is verified by the Direction Process and Continue Processing submodules (i.e., vertical displacement). Meanwhile, for d j (i) = 2, the memory position also decreases by 1, and it moves to the previous module with the displacement in the memory position. The circuit after the first multiplexer prevents negative addresses in the memory.
Given that the path to align the first element is found, the Alignment Block submodule receives the rAddrDir j and d j (i) signals to define the path to be followed by the next BS, as seen in Fig 9. Initially, a logical circuit enables the BT Start and Direction Process submodules to propagate those signals to the Alignment Block. The Direction Process and Continue Processing submodules carry out checks to define which BS module is active, that is, for d j (i) = 1, the data processing is held in the current BS module, and for d j (i)6 ¼1, the signal BT Next is enabled, indicating the end of data processing in the current PE to start in the next one.
After finding the module for the maximum score, the mRAM i and mRAM j signs finish their function. Thus, from the determination of the BS with the maximum score, the path(j) sign is used as a guide for locating the alignment of each module. Then, the data in MM is requested and the d j value is returned for verification and establishment of alignment. The verification and establishment of the alignment path is done by the Memory Index, Direction Process, and Continue Processing submodules. Decisions related to d j value are made in Alignment Block submodule, as illustrated in Fig 10. Finally, the Finish Processing and Continue Processing submodules finish the data processing in the module. Thereby, the valDir output of each submodule is used to construct the alignment path, along with the maximum score position values. The trace-back continues until it reaches PE 0 or finds a path direction with a value of 0.
Algorithm 2 presents the SW pseudo-code for backtracking stage this proposal. The backtracking stage, algorithm 2, is ready to perform the alignment in a list using the path informed in D, starting from the positions of the maximum score, as seen in this Section. Inputs for this step are provided by Algorithm 1. The loop for this step represents all backtracking stage modules from N − 1 to 0. The conditional structure of Algorithm 2 is the representation of submodule Alignment Block, Fig 10, which allows it to trace-back. And the return of the alignment path is storing the data, valDir, in RAM memory.

Results and discussion
This section presents the synthesis results for the architecture described in the previous section and analyses it regarding the following key points: critical path, operation frequency, number of PEs, and performance. The performance measures the time to calculate an element of the scoring matrix.
The development of the algorithm was carried out using the development platform provided by the FPGA manufacturer, in this case, Xilinx [60]. This platform allows the user to develop circuits using the block diagram strategy instead of VHDL or Verilog. The architecture was deployed on the FPGA Virtex-6 XC6VLX240T and compared to state-of-the-art works. Usually, hardware implementations of the SW algorithm in the literature were implemented only the forward stage or both the forward and backtracking stages. In our proposal, both stages were implemented.
The performance for hardware implementations of the SW algorithm is usually measured in Giga Cell Update Per Second (GCPUS), which in turn is defined as GCUPS ¼ number of cells total processingtime � 10 9 ; ð11Þ in which a cell can be a matrix element to be computed. This metric can also be described based on the clock frequency, that is, The latter equation is often used to compare the systolic array efficiency. Since the number of cells is equivalent to the number of PEs, and the clock frequency defines the operating frequency, it is unnecessary to measure the total runtime of the algorithm.

Hardware architecture validation
To validate the architecture proposed in this work, the sequences q and s were randomly generated and varying the match, mismatch, and linear gap values. Initially, the analysis was carried out for 8 PEs and by varying the size of the sequences q and s from 8 to 32. The number of PEs also varied according to the size of q. Our architecture works with sequences of varying lengths, only requiring that the length of s is greater than or equal to the q length. Firstly, the correctness of the matrices H and D was verified by monitoring the MSM outputs, such as Sc and Direction, as described in Section 2. Secondly, it was verified if the D matrix elements were stored in the correct memory positions in the MM. Lastly, the operation of the BS modules was also verified by monitoring the path(j + 1) bus and the Memory Index submodule.
Following, the Alignment Block and Direction Process are observed to check if the memory accesses are in accordance with the path(j + 1) value, that is, according to Eq 10. Also, the Finish Processing and Continue Processing submodules are monitored to verify the values propagated for a match (2), horizontal gap (3), and vertical gap (1).
The data bit-width was defined by the maximum size of the input sequences, limited by FPGA memory capacity. Hence, the input sequence bit-width was set to 3 while constants were defined according to its value. Besides, the bit-width for the MSM buses that perform mathematical operations was defined as log total − PEs × α. Meanwhile, the sequence counters for s is log s − size . Fig 11 shows the architecture deployed and running on the Virtex-6 FPGA. The host computer (i7-3632QM CPU and 8GB of RAM) was used to plot the results and compare them to a software implementation presented in [61], as shown in Fig 12. In the Fig 12, it can be seen that the y axis refers to the s sequence, while the x axis refers to the q sequence. To increase the resolution of the image, only the parts of the sequences that are aligned are used, where the position at which the alignment starts and the maximum score value are shown in the title of the illustration. The value of Row refers to the position in the s, whereas Column is related to the element of the q. The amount of sequence alignment performed is represented by Number of Alignments.
The architecture parameters for the demo were set to match = 5, mismatch = −5, gap = 1, and 128 PEs. Hence, the size of the sequence q is also 128. Meanwhile, the size of the sequence s was set to 8,192, resulting in a total of 1, 048, 576 calculated cells. Sequence q is loaded into memory at each iteration, where it can vary between 4 different 128 nucleotide sequences in the demonstration. The demo is available at [62], and the implementation source code is available at [63].
The SW architecture was developed using the Xilinx System Generator on Matlab, and the traffic of data between the host PC and the FPGA was accomplished via Ethernet protocol. Moreover, we added on the FPGA a buffer to store data and developed a manager circuit to control the data flow. Therefore, the q and s sequences were transferred to the FPGA (via Ethernet protocol) and stored in the buffer, and, subsequently, fed into the SW architecture to perform the alignment by the manager circuit.

Synthesis analysis
Analysis of the synthesis results for the SW hardware implementation were carried out for two FPGAs: Virtex-6 XC6VLX240T and Virtex-7 XC7VX485T. Table 2 presents the hardware area occupation and frequency for a different number of PEs. The size of the input sequences were defined according to the number of PEs.
The critical path of the design was �8.34ns and �6.44ns for the Virtex-6 and Virtex-7, respectively. Therefore, the maximum clock frequency was 120MHz for the Virtex-6 and 155MHz for the Virtex-7. Regarding the FPGA area occupation, increasing the number of PEs also increases the hardware resources used. For 512 PEs in the Virtex-6, a total of 68% of the Slice Look-Up Tables (LUTs) were used in contrast to only 7% for 64 PEs. Concerning the frequency, a slight decrease is observed as the number of PE increases due to an increase in the critical path. Concerning the Virtex-7, there are unused FPGA resources as less than 35% of Slices LUTs were used. Therefore, it can be used to increase the number of PEs and, thus, the performance. Note that, increasing the number of PEs and, consequently, the size of the sequence q, the number of parallel computations will also increase, thus, improving the performance. Therefore, according to the resources available in the target hardware, our architecture can operate with a number significantly bigger than 512 PEs.

Comparison with other works
Comparisons with state-of-the-art works were also performed. The performance of systolic array-based implementations increases with the number of PEs. Hence, the comparisons were carried out for the maximum number of PE in each proposal. We compared our design to the most relevant and recent works similar to our proposal, i.e., the SW algorithm has to be implemented using a systolic-array structure, deploy the backtracking step, and provide the parameters concerning processing time and area occupation. Given that, we discuss a direct  comparison to the architecture presented in [23,41,47]. The remaining works shown in Table 3 illustrate general results of other FPGA implementations of the SW. The works presented in Table 3 are available in [23]. The second column indicates whether the backtracking stage was also developed on FPGA or only the forward. Meantime, the third to fifth columns present the number of PEs, operating frequency, and performance, respectively. The performance was obtained according to Eq 12. As can be seen, our approach and the one proposed by [23] were the only ones to implement a high number of PEs. However, in [23] only the backtracking path was deployed on the FPGA, and a submatrix structure is used to load the path chosen for alignment. Meanwhile, our architecture relies on a memory storage structure and the definition of the maximum score to align the sequences. The architecture proposed by [47] achieved the best performance, as can be seen in Table 3. However, it uses a register-file concept instead of systolic-array. Therefore, due to the similarity of hardware techniques used to deploy the SW algorithm, we discuss a result comparison with [23], which achieved the second-best performance.
Furthermore, a comparison with [23] was also carried out regarding the FPGA area occupation, and it is presented in Table 4. The second and third columns present the FPGA and the number of PEs used, respectively. Meanwhile, the third and fourth columns present the slices and memory blocks occupied, and the fifth column the operating frequency.
As shown in Table 4, for the same number of PEs, our architecture occupied 35, 286 slices and 0 BRAMs in contrast to 57, 870 slices and 896 block RAMs (28 Mbits memory) in [23]. Also, the total area occupation was higher than 60%, compared to 46% on ours, due to the substitution matrix. Therefore, our proposal has high scalability due to the low resource usage (can reach up to 1, 024 PEs for the XC7VX485T). Besides, our implementation proposal can be implemented in smaller FPGAs, such as the Virtex-6 XC6VLX240T, with a reasonable nucleotide sequence.
Regarding the operation frequency, our proposal can reach up to 155 MHz. So, it is observed that the proposals with the best performances have a similar structure, even with different approaches to the solution. Our proposal and [23] achieving the same performance for the frequency of 150 MHz. The proposal with the highest frequency achieved was that of [47] reaching 500 MHz with Virtex-5 XC5VLX50T FPGA. Therefore, our work uses fewer hardware resources to perform the alignment process due to the chosen backtracking approach. As the backtracking stage results in high computational complexity, we simplified the process using the path mapping through the maximum value in D and H, resulting in linear computational complexity. On the other hand, the architecture proposed by [23] uses considerably more memory resources due to data partitioning and prefetching for the backtracking step. Despite both works achieving similar performance due to the systolic array, there are significant differences in the alignment approach chosen for the FPGA implementation.
The SW proposed by [41] is-to our knowledge-the most recent work on sequence alignment with the SW algorithm that also embeds the backtracking process in custom hardware. Their design achieved similar performance to ours for 512 PEs, as shown in Table 3. However, our approach can reach up to 1024 PEs embedded in the Virtex-7 XC7VX485T, a lower clock frequency and, thus, double the performance in GCUPS. Despite that, the design proposed by [23] achieved the second best overall performance. Meantime, the architecture proposed in [47], using a register-file concept, achieved 128GCUPS, the best overall performance shown in Table 4.
The hardware implementation of the alignment process through our approach, developed based on a chain of directions and the maximum score address, is a key contribution for the low use of memories. Thus, as we did not carry out tests with real biological datasets, theoretically speaking, it is possible to achieve high hardware scalability. Besides, the sequence of any size can be aligned with our approach limited by the hardware resources available. In addition, the proposed method can compress the data, using only 3 bits in a fixed-point implementation.

Conclusion
This paper presented a parallel FPGA platform design to accelerate both the forward and backtracking stages of the SW algorithm. The main contributions were the high-speed data processing implementation and low memory usage that theoretically allows high scalability. The hardware resources available on the FPGA are a limiting factor to the size of the score matrix but not to the size of sequences to be aligned. Therefore, satisfying the high-throughput, ultralow-latency and low-power requirements and to alleviate the raw data processing problem in bioinformatics. From the strategy of storing alignment path distances and maximum score position during forward stage processing, it was possible to reduce the complexity of backtracking stage processing which allowed to follow the path directly. The proposal architecture achieved a satisfactory critical path, reduced memory usage and, theoretically, a high scalability for two-step SW algorithm. Synthesis results showed that the proposed method could support up to 1, 024 PEs in only one FPGA, using the Xilinx Virtex-7 XC7VX485T. The main advantage is the low hardware resource usage and high performance of 79.5 GCUPS, with an operating frequency of up to 155MHz, without using external resources.